-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: multi column stats #160
Conversation
let mut rows = Vec::<Vec<Option<Value>>>::new(); | ||
for (col, col_type) in cols.iter().zip(col_types.iter()) { | ||
for (idx, val) in Self::to_typed_column(col, col_type).iter().enumerate() { | ||
rows[idx].push(val.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause out-of-bound access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, i'll add some tests
let (fully_null_rows, non_fully_null_rows): (Vec<_>, Vec<_>) = rows | ||
.into_iter() | ||
.partition(|row| row.iter().all(|v| v.is_none())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This operation might be very costly, as all the rows are copied once more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right... well the previous operation had a "flatten" and a "collect". that also copied stuff no?
pub fn extend_with_multi_column_stats< | ||
I: IntoIterator<Item = Result<RecordBatch, ArrowError>>, | ||
>( | ||
&mut self, | ||
batch_iter_builder: impl Fn() -> anyhow::Result<RecordBatchIterator<I>>, | ||
combinations: Vec<Vec<usize>>, | ||
) -> anyhow::Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you plan to use this function? Looks like each invocation is two more scans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
each invocation is len(combinations) x 2 scans, right now, for simplicity
it can be made x2 scans only
actually, i will fully refactor all this before merging anything
@@ -103,7 +105,7 @@ pub struct PerColumnStats<M: MostCommonValues, D: Distribution> { | |||
|
|||
// distribution _does not_ include the values in mcvs | |||
// distribution _does not_ include nulls | |||
pub distr: D, | |||
pub distr: Option<D>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we separate PerColumnStats
and CombColumnStats
? Just in case they diverge further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, probably. also has a better naming
This PR adds support for multi column statistics & minor bug-fixes.